Quality of Red Wine by TU YEMEI

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

At first, we can have an overview of the distribution of wine quality:

I want to have a view of the other chemical properties distribution

make some adjustment to the chemical properties
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## $x
## [1] "chlorides"
## 
## attr(,"class")
## [1] "labels"
## Warning: Removed 79 rows containing non-finite values (stat_bin).

The free sulfur dioxide and total sulfur dioxide is positively skewed
## Warning: Removed 77 rows containing non-finite values (stat_bin).
## Warning: Removed 80 rows containing non-finite values (stat_bin).

Univariate Analysis

What is the structure of your dataset?

13varibles,1599 obervations

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think three chemical varibles of acid, two varibles of sulfur can be combined as single varible, and with the other chemical properties.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

some varibles have long tail, and i limit the x scale to the 95% in order to remove some outliers.

Bivariate Plots Section

I want to add some lines to show the general trend

In this plot, we can observe that alcohol inreases with the quality.

Labeling quality with different levels & boxplot

I want to label quality of wine into several different category, thus I can use box plot to explore it.

Make the same analysis for the other 3 factors and explore the plot

negative relationship between quality and volatile acidity

positive relationship

It is clearly that higher quality has higher content of alcohol.
Since there are six levels, and number of the highest and lowest level are too small to be observed clearly, I plan to combine two levels together, and divide the six levels into three wider levels: Low, Median, High.
## Warning: Ignoring unknown parameters: binwidth

In this way, we can get a more clear overview of alcohol distribution and we can apply the same analysis to the other three factors.
## Warning: Ignoring unknown parameters: binwidth

Density performs better than histogram in volatile acidity analysis, since from the upper plot, we can only conclude that most wine in the 0.5-xais ,and from the lower plot, higher quality requires lower acidity
## Warning: Ignoring unknown parameters: binwidth

Higher quality requires higher citric acid.
## Warning: Ignoring unknown parameters: binwidth

This plot is long tale which needs some function, I omit the top 1% of sulphates.
## Warning: Ignoring unknown parameters: binwidth

Explore two main variables

## `geom_smooth()` using method = 'gam'

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
At first, I calculate the Pearson Correlation of different variables, especilly focus on relationships with quality. And I explore more on top 4 factors influencing quality the most, it turns out that volatile acidity has a negative influence, while citric acid and alcohol have a positive influence on quality.

Multivariate Plots Section

We already know that alcohol contributes a lot to the quality of wine, and now, I want to insert other variables to see if they contribute to the quality in other way.
## Warning: Removed 61 rows containing missing values (geom_point).

From the plot above, takes level-3 for example, it ranges from 0.9925-1.0000, which indicates that density contributes little to quality level, which the alcohol has a weak negative influence on density telling from the plot.
## Warning: Removed 48 rows containing non-finite values (stat_smooth).
## Warning: Removed 64 rows containing missing values (geom_point).

Since the plot is hard to explore, thus I use facet_wrap function to divide it. From the plot, We can infer that low quality between 1-3 lies around 0.5 in y-axis, while quality level of 4-6 tend to have higher y-axis, and also x-axis, thus higher quality wine have higher alcohol and sulphates.

The influence of PH is not obvious
## Warning: Removed 103 rows containing non-finite values (stat_smooth).
## Warning: Removed 103 rows containing missing values (geom_point).

Every quality level has a large range of y-axis, which means residual sugar has little influence on quality.
## Warning: Removed 64 rows containing non-finite values (stat_smooth).
## Warning: Removed 94 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_smooth).

Lower total sulfur dioxide and higher alcohol produces higher quality.
## Warning: Removed 86 rows containing non-finite values (stat_smooth).
## Warning: Removed 108 rows containing missing values (geom_point).

No obvious correlation

linear analysis:

Predict the wine quality based on chemical properties
## 
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = wine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = wine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates, 
##     data = wine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid, data = wine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides, data = wine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide, data = wine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates + 
##     citric.acid + chlorides + total.sulfur.dioxide + density, 
##     data = wine)
## 
## ==========================================================================================================================
##                              m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------------
##   (Intercept)               6.566***      3.095***      2.611***      2.646***      2.769***      2.985***     -0.953     
##                            (0.058)       (0.184)       (0.196)       (0.201)       (0.202)       (0.206)      (11.990)    
##   volatile.acidity         -1.761***     -1.384***     -1.221***     -1.265***     -1.155***     -1.104***     -1.114***  
##                            (0.104)       (0.095)       (0.097)       (0.113)       (0.115)       (0.115)       (0.120)    
##   alcohol                                 0.314***      0.309***      0.309***      0.292***      0.276***      0.280***  
##                                          (0.016)       (0.016)       (0.016)       (0.016)       (0.017)       (0.020)    
##   sulphates                                             0.679***      0.696***      0.871***      0.908***      0.903***  
##                                                        (0.101)       (0.103)       (0.111)       (0.111)       (0.112)    
##   citric.acid                                                        -0.079         0.021         0.065         0.044     
##                                                                      (0.104)       (0.106)       (0.106)       (0.124)    
##   chlorides                                                                        -1.663***     -1.763***     -1.747***  
##                                                                                    (0.405)       (0.403)       (0.406)    
##   total.sulfur.dioxide                                                                           -0.002***     -0.002***  
##                                                                                                  (0.001)       (0.001)    
##   density                                                                                                       3.923     
##                                                                                                               (11.944)    
## --------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.153         0.317         0.336         0.336         0.343         0.352         0.352     
##   adj. R-squared            0.152         0.316         0.335         0.334         0.341         0.349         0.349     
##   sigma                     0.744         0.668         0.659         0.659         0.656         0.651         0.652     
##   F                       287.444       370.379       268.912       201.777       166.407       143.910       123.298     
##   p                         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1794.312     -1621.814     -1599.384     -1599.093     -1590.662     -1580.192     -1580.138     
##   Deviance                883.198       711.796       692.105       691.852       684.595       675.689       675.643     
##   AIC                    3594.624      3251.628      3208.768      3210.186      3195.324      3176.384      3178.276     
##   BIC                    3610.756      3273.136      3235.654      3242.448      3232.964      3219.401      3226.670     
##   N                      1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================
The model can be described as:wine_quality = 2.985 + 0.276xalcohol - 2.985xvolatile.acidity + 0.908xsulphates + 0.065xcitric.acid - -1.763*chlorides - 0.002xtotal.sulfur.dioxide

Final Plots and Summary

Plot One

Description One

This data most lies in quality level of 5-7, and alcohol has an obvious positive influence on quality, the better quality , the higher alcohol percentage. The line is clearyly showed the trend.However, from the linear modeling anlysis, alcohol plays an important role, but only up to 27%, is not the only factor resulting the quality of wine.

Plot Two

Description Two

Plot Three

## Warning: Removed 734 rows containing non-finite values (stat_smooth).
## Warning: Removed 808 rows containing missing values (geom_point).
## Warning: Removed 455 rows containing non-finite values (stat_smooth).
## Warning: Removed 474 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_smooth).

Description Three

In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.

Reflection

The red wine dataset contains 1,599 observation with 11 variables on the chemical properties. I focus on the correlation between chemical properties and quality, and explore which varibles has the most influence on quality, futhermore, when analyzing multivariate, I even figuerd out the correlations of different chemical properties besides with quality. And, in the last, I made linear modeling in order to quantify the influence exactly.
However, other chemical properties shows weak correlation with quality, either from visualization or statistic calculation. Wine quality is a complex problem, it is influenced by many factors, thus I used linear modeling to analyze it which is over simplified model.
In my opinion, the variables of wine are not very suitable for analyzing, since only 4 factors are proved to have correlation with quality. I propose that the data should be added some useful variables for further analysis, such as produce_place, temperature, water percent, environment, year. Plus, most data of this data set is between quality level of 5-6, low quality, and high quality have small scale of data, we should be provided with more data of this level to have some deeper analysis.